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Abstract — The effectiveness and scalability of MapReduce- 
based implementations of complex data-intensive tasks depend 
on an even redistribution of data between map and reduce 
tasks. In the presence of skewed data, sophisticated redistribution 
approaches thus become necessary to achieve load balancing 
among all reduce tasks to be executed in parallel. For the 
complex problem of entity resolution, we propose and evaluate 
two approaches for such skew handling and load balancing. The 
approaches support blocking techniques to reduce the search 
space of entity resolution, utilize a preprocessing MapReduce 
job to analyze the data distribution, and distribute the entities 
of large blocks among multiple reduce tasks. The evaluation on 
a real cloud infrastructure shows the value and effectiveness of 
the proposed load balancing approaches. 

I. Introduction 

Cloud computing [2] has become a popular paradigm for ef- 
ficiently processing computationally and data-intensive tasks. 
Such tasks can be executed on demand on powerful distributed 
hardware and service infrastructures. The parallel execution of 
complex tasks is facilitated by different programming models, 
in particular the widely available MapReduce (MR) model [5] 
supporting the largely transparent use of cloud infrastructures. 
However, the (cost-) effectiveness and scalability of MR imple- 
mentations depend on effective load balancing approaches to 
evenly utilize available nodes. This is particularly challenging 
for data-intensive tasks where skewed data redistribution may 
cause node- specific bottlenecks and load imbalances. 

We study the problem of MR-based load balancing for the 
complex problem of entity resolution (ER) (also known as 
object matching, deduplication, record linkage, or reference 
reconciliation), i.e., the task of identifying entities referring 
to the same real-world object [13]. ER is a pervasive problem 
and of critical importance for data quality and data integration, 
e.g., to find duplicate customers in enterprise databases or to 
match product offers for price comparison portals. ER tech- 
niques usually compare pairs of entities by evaluating multiple 
similarity measures to make effective match decisions. Naive 
approaches examine the complete Cartesian product of n input 
entities. However, the resulting quadratic complexity of O(n^) 
is inefficient for large datasets even on cloud infrastructures. 
The common approach to improve efficiency is to reduce 
the search space by adopting so-called blocking techniques 
[3]. They utilize a blocking key on the values of one or 
several entity attributes to partition the input data into multiple 
partitions (called blocks) and restrict the subsequent matching 



to entities of the same block. For example, product entities 
may be partitioned by manufacturer values such that only 
products of the same manufacturer are evaluated to find 
matching entity pairs. 

Despite the use of blocking, ER remains a costly process 
that can take several hours or even days for large datasets 
[12]. Entity resolution is thus an ideal problem to be solved in 
parallel on cloud infrastructures. The MR model is well suited 
to execute blocking-based ER in parallel within several map 
and reduce tasks. In particular, several map tasks can read the 
input entities in parallel and redistribute them among several 
reduce tasks based on the blocking key. This guarantees that 
all entities of the same block are assigned to the same reduce 
task so that different blocks can be matched in parallel by 
multiple reduce tasks. 

However, such a basic MR implementation is susceptible 
to severe load imbalances due to skewed blocks sizes since 
the match work of entire blocks is assigned to a single reduce 
task. As a consequence, large blocks (e.g., containing 20% 
of all entities) would prevent the utilization of more than 
a few nodes. The absence of skew handling mechanisms 
can therefore tremendously deteriorate runtime efficiency and 
scalability of MR programs. Furthermore, idle but instantiated 
nodes may produce unnecessary costs because public cloud 
infrastructures (e.g., Amazon EC2) usually charge per utilized 
machine hours. 

In this paper, we propose and evaluate two effective load 
balancing approaches to data skew handling for MR-based 
entity resolution. Note that MR's inherent vulnerability to 
load imbalances due to data skew is relevant for all kind 
of pairwise similarity computation, e.g., document similarity 
computation [9] and set- similarity joins [19]. Such applications 
can therefore also benefit from our load balancing approaches 
though we study MR-based load balancing in the context of 
ER only. In particular, we make the following contributions: 

• We introduce a general MR workflow for load-balanced 
blocking and entity resolution. It employs a preprocessing 
MR job to determine a so-called block distribution matrix 
that holds the number of entities per block separated by 
input partitions. The matrix is used by both load balancing 
schemes to determine fine-tuned entity redistribution for 
parallel matching of blocks. (Section III) 

• The first load balancing approach, BlockSplit, takes the 
size of blocks into account and assigns entire blocks to 



reduce tasks if this does not violate load balancing or 
memory constraints. Larger blocks are split into smaller 
chunks based on the input partitions to enable their parallel 
matching within multiple reduce tasks. (Section IV) 

• The second load balancing approach, PairRange, adopts 
an enumeration scheme for all pairs of entities to evaluate. 
It redistributes the entities such that each reduce task has 
to compute about the same number of entity comparisons. 
(Section V) 

• We evaluate our strategies and thereby demonstrate the im- 
portance of skew handling for MR-based ER. The evaluation 
is done on a real cloud environment, uses real-world data, 
and compares the new approaches with each other and the 
basic MR strategy. (Section VI) 

In the next section we review the general MR program 
execution model. Related work is presented in Section VII 
before we conclude. Furthermore, we describe an extension 
of our strategies for matching two sources in Appendix I. 
Appendix II lists the pseudo-code for all proposed algorithms. 

II. MapReduce Program Execution 

MapReduce (MR) is a programming model designed for 
parallel data-intensive computing in cluster environments with 
up to thousands of nodes [5]. Data is represented by key- value 
pairs and a computation is expressed with two user defined 
functions: 

map : {keyin^valucin) list {key tmp^valuetmp) 
reduce : {keytmp,list{valuetmp)) ^ list{key out, value out) 

These functions contain sequential code and can be executed 
in parallel on disjoint partitions of the input data. The map 
function is called for each input key- value pair whereas reduce 
is called for each key keytmp that occurs as map output. 
Within the reduce function one can access the list of all 
corresponding values list{valuetmp)- 

Besides map and reduce, a MR dataflow relies on three 
further functions. First, the function part partitions the map 
output and thereby distributes it to the available reduce tasks. 
All keys are then sorted with the help of a comparison function 
com p. Finally, each reduce task employs a grouping function 
group to determine the data chunks for each reduce function 
call. Note that each of these functions only operates on the 
key of key-value pairs and does not take the values into 
account. Keys can have an arbitrary structure and data type but 
need to be comparable. The use of extended (composite) keys 
and an appropriate choice of part, comp, and group supports 
sophisticated partitioning and grouping behavior and will be 
utilized in our load balancing approaches. 

For example, the center of Figure 1 shows an example MR 
program with two map tasks and three reduce tasks. The map 
function is called for each of the four input key-value pairs 
(denoted as ■) and the map phase emits an overall of 10 
key-value pairs using composite keys (Figure 1 only shows 
keys for simplicity). Each composite key has a shape (circle or 
triangle) and a color (light-gray, dark-gray, or black). Keys are 
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Fig. 1. Schematic overview of example MR program execution using 1 map 
process, m=2 map tasks, 2 reduce processes, and r=3 reduce tasks. In this 
example, partitioning is based on the key's color only and grouping is done 
on the entire key. 



assigned to three reduce tasks using a partition function that 
is only based on a part of the key ("color"). Finally, the group 
function employs the entire key so that the reduce function is 
called for 5 distinct keys. 

The actual execution of an MR program (also known as 
job) is realized by an MR framework implementation such as 
Hadoop [1]. An MR cluster consists of a set of nodes that 
run a fixed number of map and reduce processes. For each 
MR job execution, the number of map tasks (m) and reduce 
tasks (r) is specified. Note that the partition function part 
relies on the number of reduce tasks since it assigns key- value 
pairs to the available reduce tasks. Each process can execute 
only one task at a time. After a task has finished, another 
task is automatically assigned to the released process using a 
framework- specific scheduling mechanism. The example MR 
program of Figure 1 runs in a cluster with one map and two 
reduce processes, i.e., one map task and two reduce tasks can 
be processed simultaneously. Hence, the only map process 
runs two map tasks and the three reduce tasks are eventually 
assigned to two reduce processes. 

III. Load balancing for ER 

We describe our load balancing approaches for ER for one 
data source R. The input is a set of entities and the output 
is a match result, i.e., pairs of entities that are considered 
to be the same. With respect to blocking, we assume that 
all entities have a valid blocking key. The generalization to 
consider entities without defined blocking key (e.g. missing 
manufacturer information for products) is relatively easy. All 
entities ^ R without blocking key need to be matched with 
all entities, i.e., the Cartesian product of R x R^ needs to be 
determined which is a special case of ER between two sources. 
The Appendix explains how our strategies can be extended for 
matching two sources. 

As discussed in the introduction, parallel ER using blocking 
can be easily implemented with MR. The map function can 
be used to determine for every input entity its blocking key 




Fig. 2. Schematic overview of the MR-based matching process with load 
balancing. 

and to output a key-value pair (blocking_key, entity). The 
default partitioning strategy would use the blocking key to 
distribute key-value pairs among reduce tasks so that all 
entities sharing the same blocking key are assigned to the 
same reduce task. Finally, the reduce function is called for 
each block and computes the matching entity pairs within its 
block. We call this straightforward approach Basic. However, 
the Basic strategy is vulnerable to data skew due to blocks 
of largely varying size. Therefore the execution time may be 
dominated by a single or a few reduce tasks. Processing large 
blocks may also lead to serious memory problems because 
entity resolution requires that all entities within the same block 
are compared with each other. A reduce task must therefore 
store all entities passed to a reduce call in main memory - or 
must make use of external memory which further deteriorates 
execution times. 

A domain expert might, of course, adjust the blocking 
function so that it returns blocks of similar sizes. However, this 
tuning is very difficult because it must ensure that matching 
entities still reside in the same block. Furthermore, the block- 
ing function needs to be adjusted for every match problem 
individually. We therefore propose two general load balancing 
approaches that address the mentioned skew and memory 
problems by distributing the processing of large blocks among 
several reduce tasks. Both approaches are based on a general 
ER workflow with two MR jobs that is described next. The 
first MR job, described in Section III-B, analyzes the input 
data and is the same for both load balancing schemes. The 
different load balancing strategies BlockSplit and PairRange 
are described in the following sections IV and V, respectively. 

A. General ER Workflow for Load Balancing 

To realize our load balancing strategies, we perform ER 
processing within two MR jobs as illustrated in Figure 2. Both 
jobs are based on the same number of map tasks and the 
same partitioning of the input data.^ The first job calculates 
a so-called block distribution matrix (BDM) that specifies the 
number of entities per block separated by input partitions. The 
matrix is used by the load balancing strategies (in the second 

^ See Appendix II for details. 
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Fig. 3. The example data consists of 14 entities A-O that are divided into 
two partitions Ho and Hi. 

MR job) to tailor entity redistribution for parallel matching of 
blocks of different size. 

Load balancing is mainly realized within the map phase 
of the second MR job. Both strategies follow the idea that 
map generates a carefully constructed composite key that 
(together with associated partition and group functions) allows 
a balanced load distribution. The composite key thereby com- 
bines information about the target reduce task(s), the block 
of the entity, and the entity itself. While the MR partitioning 
may only use part of the map output key for routing, it still 
groups together key-value pairs with the same blocking key 
component of the composite key and, thus, makes sure that 
only entities of the same block are compared within the reduce 
phase. As we will see, the map function may generate multiple 
keys per entity if this entity is supposed to be processed by 
multiple reduce tasks for load balancing. Finally, the reduce 
phase performs the actual ER and computes match similarities 
between entities of the same block. Since the reduce phase 
consumes the vast majority of the overall runtime (more 
than 95% in our experiments), our load balancing strategies 
solely focus on data redistribution for reduce tasks. Other 
MR-specific performance factors are therefore not considered. 
For example, consideration of data locality (see, e.g., [10]) 
would have only limited impact and would require additional 
modification of the MR framework. 

B. Block Distribution Matrix 

The block distribution matrix (BDM) is a 6 x m matrix that 
specifies the number of entities of h blocks across m input par- 
titions. The BDM computation using MR is straightforward. 
The map function determines the blocking key for each entity 
and outputs a key- value pair with a composite map output key 
(blocking_key partitionjndex) and a corresponding value of 
1 for each entity^. The key- value pairs are partitioned based 
on the blocking key component to ensure that all data for 
a specific block is processed in the same reduce task. The 
reduce task's key- value pairs are sorted and grouped by the 
entire key and reduce counts the number of blocking keys 
(i.e., entities per block) per partition and outputs triples of the 
form (blocking key, partition index, number of entities). 

For illustration purposes, we use a running example with 14 
entities and 4 blocking keys as shown in Figure 3. Figure 4 
illustrates the computation of the BDM for this example data. 
So the map output key of M is z.l because M's blocking 
key equals z and M appears in the second partition (partition 
index=l). This key is assigned to the last reduce task that 

combine function that aggregates the frequencies of the blocking keys 
per map task might be employed as an optimization. 



/lap: Blocking 



EDI 

A 
B 

^ 

E 
F 
G 

EH 

H 
I 

K 
L 

M 
N 
O 



w.O 1 


w.O 


1 




1 ■ 


\.o 


x.O 


1 


z.O 


1 


z.O 


1 


w.l 1 


w.l 


1 


y.l 


i ■ 


z.l 


1 


z.l 


1 


z.l 


1 





Reduce: Count per Block + Partition 
Group by (BlockKey.Partition) 












It 


o 

dj 
u 




[w, 0, 2] 


4 


-D 




[w, 1, 2] 


m 






[y, 1, 2] 



Frequencies 



[X, 0, 3] 
[z, 0, 2] 
[z, 1, 3] 







Hi 


IQQ 2 


2 




2 









3 



Block Distribution Matrix (BDM) 



Fig. 4. Example dataflow for computation of the block distribution matrix 
(MR Jobl of Figure 2) using the example data of Figure 3. 



outputs [2:, 1,3] because there are 3 entities in the second 
partition for blocking key z. The combined reduce outputs 
correspond to a row-wise enumeration of non-zero matrix 
cells. To assign block keys to rows of the BDM, we use the 
(arbitrary) order of the blocks from the reduce output, i.e., we 
assign the first block (key w) to block index position 0, etc. 
The block sizes in the example vary between 2 and 5 entities. 
The match work to compare all entities per block with each 
other thus ranges from 1 to 10 pair comparisons; the largest 
block with key z entails 50% of all comparisons although it 
contains only 35% (5 of 14) of all entities. 

As illustrated in Figure 2, map produces an additional output 
n • per partition that contains the original entities annotated 
with their blocking keys. This output is not shown in Figure 4 
to save space but used as input in the second MR job (see 
Figures 5 and 7). 

IV. Block-based Load Balancing 

The first strategy, BlockSplit, generates one or several so- 
called match tasks per block and distributes match tasks among 
reduce tasks. Furthermore, it uses the following two ideas: 

• BlockSplit processes small blocks within a single match 
tasks similar to the basic MR implementation. Large blocks 
are split according to the m input partitions into m sub- 
blocks. The resulting sub-blocks are then processed using 
match tasks of two types. Each sub-block is (like any unsplit 
block) processed by a single match task. Furthermore, pairs 
of sub-blocks are processed by match tasks that evaluate the 
Cartesian product of two sub-blocks. This ensures that all 
comparisons of the original block will be computed in the 
reduce phase. 

• BlockSplit determines the number of comparisons per 
match task and assigns match tasks in descending size 



among reduce tasks. This implements a greedy load bal- 
ancing heuristic ensuring that the largest match tasks are 
processed first to make it unlikely that they dominate or 
increase the overall execution time. 

The realization of BlockSplit makes use of the BDM as well 
as of composite map output keys. The map phase outputs key- 
value pairs with key=(reduce_index block_index split) and 
value=(entity). The reduce task index is a value between and 
r — 1 is used by the partition function to realize the desired 
assignment to reduce tasks. The grouping is done on the entire 
key and - since the block index is part of the key - ensures 
that each reduce function only receives entities of the same 
block. The split value indicates what match task has to be 
performed by the reduce function, i.e., whether a complete 
block or sub-blocks need to be processed. In the following, 
we describe map key generation in detail. 

During the initialization, each of the m map tasks reads 
the BDM and computes the number of comparison per block 
and the total number of comparisons P over all b blocks ^k' 



2 



b-l 



^fcl '(l^^fcl — !)• For each block it also checks 
if the number of comparisons is above the average reduce task 
workload, i.e., if 



\^k\-{\^k\-l)>P/r 



If the block <l>/c is not above the average workload it can be 
processed within a single match task (this is denoted as /c.* in 
the block.index and split components of the map output key). 
Otherwise it is split into m sub-blocks based on the m input 
partitions^ leading to the following ^ • m • (m — 1) + m match 
tasks: 

• m match tasks, denoted with key components k.i, for the 
individual processing of the sub-block for z G [0, m — 1] 

• ^ • m • (m — 1) match tasks, denoted with key components 
k.ixj with j G [0, m — 1] and i < j, for the computation 
of the Cartesian product of sub-blocks i and j 

To determine the reduce task for each match task, all match 
tasks are first sorted in descending order of their number of 
comparisons. Match tasks are then assigned to reduce tasks 
in this order so that the current match task is assigned to the 
reduce task with the lowest number of already assigned pairs. 
In the following, we denote the reduce task index for match 
task k.x with R(k.x). 

After the described initialization phase, the map function is 
called for each input entity. If the entity belongs to a block 
that has not to be split, map outputs one key- value pair 
with composite kQy=R(k.^).k.^. Otherwise, map outputs m 
key- value pairs for the entity. The key R(k.i).k.i represents the 
individual sub-block i of block and the remaining m — 1 
pairs represent all combinations with the other m — 1 sub- 
blocks. This indicates that entities of split blocks are replicated 

^Note that the BDM holds the number of entities per (block, partition) pair 
and map can therefore determine which input partitions contain entities of 
<^»fc. However, in favor of readability we assume that all m input partitions 
contain at least one entity. Our implementation, of course, ignores unnecessary 
partitions. 
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Fig. 5. Example dataflow for the load balancing strategy BlockSpNt. 



m times to support load balancing. The map function emits 
the entity as value of the key-value pair; for split blocks we 
annotate entities with the partition index for use in the reduce 
phase. 

In our running example, only block <l>3 (blocking key z) 
is subject to splitting into m=2 sub-blocks. The BDM (see 
Figure 4) indicates for block <l>3 that IIo and Hi contain 2 
and 3 entities, respectively. The resulting sub-blocks $3.0 and 
<l>3.i lead to the three match tasks 3.0, 3.0x1, and 3.1 that 
account for 1, 6, and 3 comparisons, respectively. The resulting 
ordering of match tasks by size (0.^, 3.0x1, 2.*, 3.1, 7.*, and 
3.0) leads for three reduce tasks to the distribution shown in 
Figure 5. The replication of the five entities for the split block 
leads to 19 key- value pairs for the 14 input entities. Each 
reduce task has to process between six and seven comparisons 
indicating a good load balancing for the example. 

V. Pair-based Load Balancing 

The block-based strategy BlockSplit splits large blocks ac- 
cording to the input partitions. This approach may still lead to 
unbalanced reduce task workloads due to differently- sized sub- 
blocks. We therefore propose a more sophisticated pair-based 
load balancing strategy Pair Range that targets at a uniform 
number of pairs for all reduce tasks. It uses the following two 
ideas: 

• PairRange implements a virtual enumeration of all entities 
and relevant comparisons (pairs) based on the BDM. The 
enumeration scheme is used to sent entities to one or more 
reduce tasks and to define the pairs that are processed by 
each reduce tasks. 

• For load balancing, PairRange splits the range of all 
relevant pairs into r (almost) equally sized pair ranges and 
assigns the /c* range to the /c* reduce task. 
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Fig. 6. Global enumeration of all pairs for the running example. The three 
different shades indicate how the PairRange strategy assigns pairs to 3 
reduce tasks. 



Each map task processes its input partition row-by-row 
and can therefore enumerate entities per partition and block. 
Although entities are processed independently in different 
partitions, the BDM permits to compute the global block- 
specific entity index locally within the map phase. Given a 
partition 11^ and a block the overall number of entities of 
in all preceding partitions Hq through n^_i has just to be 
added as offset. For example, entity M is the first entity of 
block $3 in partition Hi. Since the BDM indicates that there 
are two other entities in in the preceding partition IIq, M 
is the third entity of and is thus assigned entity index 2. 
Figure 6 shows block-wise the resulting index values for all 
entities of the running example (white numbers). 

Enumeration of entities allows for an effective enumeration 
of all pairs to compare. An entity pair (x^y) with entity 
indexes x and y is only enumerated if x < y. We thereby avoid 
unnecessary computation, i.e., pairs of the same entity (x, x) 
are not considered as well as pairs (y^x) if (x^y) has already 
been considered. Pair enumeration employs a column-wise 
continuous enumeration across all blocks based on information 
of the BDM. The pair index Pi{x,y) of two entities with 
indexes x and y (x < y) in block is defined as follows: 



Pi{x, y) = c{x, y, + o{i) (1) 

with c(x, y,N) = | (2 • TV - x - 3) + - 1 and o{i) = \ • 
^1=0 (l^fel • (l^fcl - 1)). Here c{x,y,N) is the index of the 
cell [x^y) m dx\. N X N matrix and o(i) is the offset and 
equals the overall number of pairs in all preceding blocks $0 
through The number of entities in block is denoted as 
Figure 6 illustrates the pair enumeration for the running 
example. The pair index of pair y) can be found in the 
column X and row y of block For example, the index for 
pair (2, 3) of block <l>o equals 5. 

PairRange splits the range of all pairs into r almost equally- 
sized pair ranges and assigns the /c* range 3?/^ to the /c* reduce 
task, k is therefore both, the reduce task index and the range 
index. Given a total of P pairs and r ranges, a pair with index 
< p < P falls in 3?^ if 

p G 4^ = [r • |j (2) 

The first r — 1 reduce tasks processes [^] pairs each 
whereas the last reduce task is responsible for the remaining 
1) • [^] pairs. In the example of Figure 6, we 
20 pairs, so that for r = 3 we obtain the ranges 
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reduce task's pair range, and - if this is the case - computes the 
matching for this pair. To this end, map additionally annotates 
each entity with its entity index so that the pair index can be 
easily computed by the reduce function. 

Figure 7 illustrates the Pair Range strategy for our running 
example. Entity M belongs to block <l>3, has an entity index 
of 2, and takes part in 4 pairs with pair indexes 11, 14, 17, 
and 18, respectively. Given the three ranges [0, 6], [7, 13], and 
[14,19], entity M has to be sent to the second reduce task 
(index=l) for pair #11 and the third reduce task (index=2) for 
the other pairs, map therefore outputs two tuples (1.3.2, M) 
and (2.3.2, M). The second reduce task not only receives M 
but all entities of ^3 (F, G, M, A^, and O). However, due 
to its assigned pair range [7, 13], it only processes pairs with 
indexes 10 through 13 of ^3 (and, of course, 7 through 9 of 
^2)- The remaining pairs of are processed by the third 
reduce task which receives all entities of but F because 
the latter does not take part in any of the pairs with index 14 
through 19 (see Figure 6). 



Fig. 7. Example dataflow for the load balancing strategy Pair Range. 



VI. Evaluation 



3?o = [0,6], 3?i = [7,13], and ^2 = [14,19] (illustrated by 
different shades). 

During the initialization, each of the m map tasks reads 
the BDM, computes the total number of comparisons P, and 
determines the r pair ranges. Afterwards, the map function 
is called for each entity e and determines e's entity index 
X as well as all relevant ranges, i.e., all ranges that contain 
at least one pair where e is participating. The identification 
of relevant ranges does not require the examination of the 
possibly large number of all pairs but can be mostly realized 
by processing two pairs. Let TV be the size of e's block, 
entity e with index x takes part in the pairs (0, x), . . . , (x — 
1, x), (x, x + 1), . . . , (x. A/" — 1). The enumeration scheme thus 
allows for a quick identification of Pmin and Pmax^ i-^-. e's 
pairs with the smallest and highest pair index. For example, 
M has an entity index of 2 within a block of size \^s\ = 5 
and the two pairs are therefore Pmin = ^3(0,2) = 11 and 
Pmax = ^3(2,4) = 18. All relevant ranges of e are between 
and ^rnax ^ Pmax because the range index is 
monotonically increasing with the pair index (see formula (2)). 
Entity M is thus only needed for the second and third pair 
range (reduce task). 

Finally, map emits a key- value pair with key= (range_index 
blockjndex entity _index) and value=entity for each relevant 
range. The MR partitioning is based on the range index only 
for routing all data of range 3?/e to the reduce task with index 
k. The sorting is done based on the entire key whereas the 
grouping is done by range index and block index. The reduce 
task does not necessarily receive all entities of a block but 
only those entities that are relevant for the reduce task's pair 
range. The reduce function generates all pairs (x^y) with 
entity indexes x < y, checks if the pair index falls into the 



In the following we evaluate our BlockSplit and Pair Range 
strategies regarding three performance-critical factors: the de- 
gree of data skew (Section VI-A), the number of configured 
map (m) and reduce (r) tasks (Section VI-B), and the number 
of available nodes (n) in the cloud environment (Section VI- 
C). In each experiment we examine a reasonable range of 
values for one of the three factors while holding constant the 
other two factors. We thereby broadly evaluate our algorithms 
and investigate to what degree they are robust against data 
skew, can benefit from many reduce tasks, and can scale with 
the number of nodes. 

We ran our experiments on Amazon EC2 cloud infras- 
tructure using Hadoop with up to 100 High- CPU Medium 
instances each providing 2 virtual cores. Each node was 
configured to run at most two map and reduce tasks in parallel. 
On each node we set up Hadoop 0.20.2 and made the same 
changes to the Hadoop default configuration as in [19]. 

We utilized two real- world datasets (see Figure 8). The 
first dataset DS1 contains about 114,000 product descriptions. 
The second dataset, DS2'^, is by an order of magnitude larger 
and contains about 1.4 million publication records. For both 
datasets, the first three letters of the product or publication title, 
respectively, form the default blocking key (in the robustness 
experiment, we vary the blocking to study skew effects). The 
resulting number of blocks as well as the relative size of the 
respective largest block are given in Figure 8. Note that the 
blocking attributes were not chosen to artificially generate data 
skew but rather reflect a reasonable way to group together 
similar entities. Two entities were compared by computing 
the edit distance of their title. Two entities with a minimal 
similarity of 0.8 were regarded as matches. 
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Fig. 9. Execution times for different data skews. 



A. Robustness: Degree of data skew 

We first evaluate the robustness of our load balancing strate- 
gies against data skew. To this end, we control the degree of 
data skew by modifying the blocking function and generating 
block distributions that follow an exponential distribution. 
Given a fixed number of blocks 6=100, the number of entities 
in the k^^ block is proportional to e~^'^. The skew factor 
5 > thereby describes the degree of data skew. Note that 
the data skew, i.e., the distribution of entities over all blocks, 
determines the overall number of entity pairs. For example, 
two blocks with 25 entities each lead to 2 • 25 • 24/2 = 600 
pairs. If the 50 entities are split 45 vs. 5 the number of pairs 
equals already 45 • 44/2 + 5 • 4/2 = 1, 000. We are therefore 
interested in the average execution time per entity pair when 
comparing load balancing strategies for different data skews. 

Figure 9 shows the average execution time per 10^ pairs for 
different data skews of DS1 (n = 10, m = 20, r = 100). The 
Basic strategy explained in Section III is not robust against 
data skew because a higher data skew increases the number 
of pairs of the largest block. For example, for 5=1 Basic 
needs 225 ms per 10^ comparisons which is more than 12 
times slower than BlockSplit and PairRange. However, the 
Basic strategy is the fastest for a uniform block distribution 
(5=0) because it does not suffer from the additional BDM 
computation and load balancing overhead. The BDM influence 
becomes insignificant for higher data skews because the data 
skew does not affect the time for BDM computation but 
the number of pairs. This is why the execution time per 
pair is reduced for increasing s. In general, both BlockSplit 
and PairRange are stable across all data skews with a small 
advantage for PairRange due to its somewhat more uniform 

^http ://asterix.ics.uci. edu/data/csx . raw . txt . gz 
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workload distribution. 

B. Number of reduce tasks 

In our next experiment, we study the influence of the 
number r of reduce tasks in a fixed cloud environment of 10 
nodes. We vary r from 20 to 160 but let the number of map 
tasks constant (m = 20). The resulting execution times for 
DS1 are shown in Figure 10. Execution times from PairRange 
and BlockSplit include the relatively small overhead (35s) for 
BDM computation. 

We observe that both BlockSplit and PairRange significantly 
outperform the Basic strategy. For example, for r=160 they 
improve execution times by a factor of 6 compared to Basic. 

Obviously, the Basic approach fails to efficiently leverage 
many reduce tasks because of its inability to distribute the 
matching of large blocks to multiple reduce tasks. Conse- 
quently, the required time to process the largest block (that 
accounts for more than 70% of all pairs, see Figure 8) forms 
a lower boundary of the overall execution time. Since the 
partitioning is done without consideration of the block size, 
an increasing number of reduce tasks may even increase the 
execution time if two or more large blocks are assigned to the 
same reduce task as can be seen by the peaks in Figure 10. 

On the other hand, both BlockSplit and PairRange take 
advantage of an increasing number of reduce tasks. BlockSplit 
provides relatively stable execution times over the entire range 
of reduce tasks underlining its load balancing effectiveness. 
PairRange gains more from a larger number of reduce tasks 
and eventually outperforms BlockSplit by 7%. 

However, even though PairRange always generates a uni- 
form workload for all reduce tasks, it may be slower than 
BlockSplit for small r. This is due to the fact that the execution 
time is also influenced by other effects. Firstly, the execution 
time of a reduce task may differ due to heterogeneous hard- 
ware and matching attribute values of different length. This 
computational skew diminishes for larger r values because of 
a smaller number of pairs per reduce task. Secondly, slightly 
unbalanced reduce task workloads can be counterbalanced by 
a favorable mapping to processes. 
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As we have shown in Section VI- A both strategies are not 
vulnerable to data skew but BlockSplit's load balancing strategy 
depends on the input (map) partitioning. To this end we have 
sorted DS1 by title and Figure 11 compares the execution times 
of for the unsorted (i.e., arbitrary order) and sorted dataset. 
Since the blocking key is the first three letters of the title, a 
sorted input dataset is likely to group together large blocks into 
the same map partition. This limits BlockSplit's ability to split 
large blocks and deteriorates its execution time by 80%. This 
effect can be diminished by a higher number of map tasks. 

Figure 12 shows the number of emitted key- value pairs 
during the map phase for all strategies. The map output for 
Basic always equals the number of input entities because Basic 
does not send an entity to more than one task and, thus, does 
not replicate any input data. The BlockSplit strategy shows a 
step-function-like behavior because the number of reduce tasks 
determines what blocks will be split but do not influence the 
split method itself which is solely based on the input partitions. 
As a consequence, BlockSplit generates the largest map output 
for a small number of reduce tasks. However, an increasing 
number of reduce tasks increases the map output only to 
limited extent because large blocks that have already been 
split are not affected by additional reduce tasks. In contrast, 
the PairRange strategy is independent from the blocks but only 
considers pair ranges. Even though the number of relevant 
entities per pair range may vary (see, e.g.. Figure 7) the 
overall number of emitted key-value pairs increases almost 
linearly with increasing number of ranges/ reduce tasks. For a 
large number of reduce tasks PairRange therefore produces the 
largest map output. The associated overhead (additional data 
transfer, sorting larger partitions) did not significantly impact 
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the execution times up to a moderate size of the utilized cloud 
infrastructure due to the fact that the matching in the reduce 
phase is by far the dominant factor of the overall runtime. We 
will investigate the scalability of our strategies for large cloud 
infrastructures in our last experiment. 

C. Scalability: Number of nodes 

Scalability in the cloud is not only important for fast 
computation but also for financial reasons. The number of 
nodes should be carefully selected because cloud infrastructure 
vendors usually charge per employed machines even if they are 
underutilized. To analyze the scalability of Basic, BlockSplit, 
and PairRange, we vary the number of nodes from 1 up to 100. 
For n nodes, the number of map tasks is set to m = 2 • n and 
the number of reduce tasks is set to r = 10-n, i.e., adding new 
nodes leads to additional map and reduce tasks. The resulting 
execution times and speedup values are shown in Figure 13 
(DSl) and Figure 14 (DS2). 

As expected, Basic does not scale for more than two nodes 
due to the limitation that all entities of a block are compared 
within a single reduce task. The execution time is therefore 
dominated by the reduce task that has to process the largest 
block and, thus, about 70% of all pairs. An increasing number 
of nodes only slightly decreases the execution time because 
the increasing number of reduce tasks reduces the additional 
workload of the reduce task that handles the largest block. 

By contrast, both BlockSplit and PairRange show their 
ability to evenly distribute the workload across reduce tasks 
and nodes. They scale almost linearly up to 10 nodes for the 
smaller dataset DS1 and up to 40 nodes for the larger dataset 
DS2, respectively. For large n we observe significantly better 
speedup values for DS2 than for DS1 due to the reasonable 
workload per reduce task that is crucial for efficient utilization 
of available cores. BlockSplit outperforms PairRange for DS1 
and n=100 nodes. The resulting large number of reduce tasks 
leads - in conjunction with the comparatively small data size 
- to a comparatively small average number of comparisons 
per reduce task. Therefore PairRange 's additional overhead 
(see Figure 12) deteriorates the overall execution time. This 
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overhead becomes insignificant for the larger dataset DS2. 
The average number of comparisons is more than 2,000 times 
higher than for DS1 (see Figure 8) and, thus, the benefit 
of optimally balanced reduce tasks outweighs the additional 
overhead of handling more key- value pairs. In general, Block- 
Split is preferable for smaller (splittable) datasets under the 
assumption that the dataset's data order is not dependent 
from the blocking key; otherwise PairRange has a better 
performance. 

VII. Related Work 

Load balancing and skew handling are well-known data 
management problems and MR has been criticized for having 
overlooked the skew issue [7]. Parallel database systems al- 
ready implement skew handling mechanisms, e.g., for parallel 
hash join processing [6] that share many similarities with our 
problem. 

A theoretical analysis of skew effects for MR is given 
in [15] but focuses on linear processing of entities in the reduce 
phase. It disregards the N-squared complexity of comparing all 
entities with each other. [14] reports that the reduce runtime 
for scientific tasks does not only depend on the assigned work- 
load (e.g., number of pairs) but also on the data to process. 
The authors propose a framework for automatic extraction of 
signatures for (spatial) data to reduce the computational skew. 
This approach is orthogonal to ours: it addresses computational 
skew and does not consider effective handling of present data 
skew. 

A fairness-aware key partitioning approach for MR that 
targets locality-aware scheduling of reduce tasks is proposed 
in [10]. The key idea is to assign map output to reduce tasks 
that eventually run on nodes that already hold a major part of 
the corresponding data. This is achieved by a modification of 
the MR framework implementation to control the scheduling 
of reduce tasks. Similar to our BDM, this approach determines 
the key distribution to optimize the partitioning. However, it 
does not split large blocks but still processes all data sharing 
the same key at the same reduce task which may lead to 
unbalanced reduce workloads. 



MR has already been employed for ER (e.g., [20]) but we 
are only aware of one load balancing mechanism for MR- 
based ER. [11] studies load balancing for Sorted Neighbor- 
hood (SN). However, SN follows a different blocking approach 
that is by design less vulnerable to skewed data. 

MR's inherent vulnerability to data skew is relevant for all 
kind of pairwise similarity computation. Example applications 
include pairwise document similarity [9] to identify similar 
documents, set-similarity joins [19] for efficient string sim- 
ilarity computation in databases, pairwise distance computa- 
tion [18] for clustering complex objects, and all-pairs matrix 
computation [16] for scientific computing. All approaches 
follow a similar idea like ER using blocking: One or more 
signatures (e.g., tokens or terms) are generated per object (e.g., 
document) to avoid the computation of the Cartesian product. 
MR groups together objects sharing (at least) one signature 
and performs similarity computation within the reduce phase. 
Simple approaches like [9] create many signatures per object 
which leads to unnecessary computation because similar ob- 
jects are likely to have more than one signature in common 
and are thus compared multiple times. Advanced approaches 
such as [19] reduce unnecessary computation by employing 
filters (e.g., based on token frequencies) that still guarantee 
that similar object pairs share at least one signature. 

A more general case is the computation of theta-joins 
with MapReduce [17]. Static load balancing mechanisms are 
not suitable due to arbitray join condititions. Similar to our 
approach [17] employs a pre-analysis phase to determine the 
datasets' characteristics (using sampling) and thereby avoids 
the evaluation of the Cartesian product. This approach is more 
coarse-grained when compared to our strategies. 

Load balancing is only one aspect towards an optimal ex- 
ecution of MR programs. For example, Manimal [4] employs 
static code analysis to optimize MR programs. Hadoop++ [8] 
proposes index and join techniques that are realized using ap- 
propriate partitioning and grouping functions, amongst others. 

VIII. Summary and outlook 

We proposed two load balancing approaches, BlockSplit and 
PairRange, for parallelizing blocking-based entity resolution 
using the widely available MapReduce framework. Both ap- 
proaches are capable to deal with skewed data (blocking key) 
distributions and effectively distribute the workload among 
all reduce tasks by splitting large blocks. Our evaluation in 
a real cloud environment using real-world data demonstrated 
that both approaches are robust against data skew and scale 
with the number of available nodes. The BlockSplit approach 
is conceptionally simpler than PairRange but achieves already 
excellent results. PairRange is less dependent on the initial 
partitioning of the input data and slightly more scalable for 
large match tasks. 

In future work, we will extend our approaches to multi- 
pass blocking that assigns multiple blocks per entity. We will 
further investigate how our load balancing approaches can be 
adapted for MapReduce-based implementations of other data- 
intensive tasks, such as join processing or data mining. 
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Appendix I 
IMatching two sources 

This section describes the extension of the BlockSplit and 
Pair Range strategy for matching two sources R and S. We 
thereby assume that all entities have a valid blocking key. 
Consideration of entities without valid blocking keys can be 
accomplished as follows: 

match b{R, S) =rr\a\chB{R — R0, S — S^) 

U matchx(i^, £'0) U matchx(i^0, S - S0) 

Given two sources an 5 with a subset R^ ^ R and S0 C 
S of entities without blocking keys, the desired match result 
match^(i?, S*) using a blocking key B can be constructed as 
union of three match results. First, the regular matching is 
applied for entities with valid blocking keys only (R — R^ and 
S — S0). The result is then completed with the match results of 
the Cartesian product of R with 5*0 and R^ with S — S^. Such 
results can be obtained by employing a constant blocking key 
(denoted as ±) so that all entity pairs are considered. 

For simplicity we furthermore assume that each input parti- 
tion contains only entities of one source (this can be ensured 



by Hadoop 's Multiplelnputs feature). The number of partitions 
may be different for each of the two sources. 

For illustration, we use the example data of Figure 15(a) 
that utilizes the entities A-N and the blocking keys w-z. Each 
entity belongs to one of the two sources R and S. Source R 
is stored in one partition IIo only whereas entities of S are 
distributed among two partitions 11 1 and 112. 

The BDIVL computation is the same but adds a source tag 
to the map output key to identify blocks with the same key in 
different sources, i.e., ^i^R and ^i^s- The BDIM has the same 
structure as for the one- source case but distinguishes between 
the two sources for each block (see Figure 15(a)). 

A. Block-based Load Balancing 

The BlockSplit strategy for two sources follows the same 
scheme as for one source. The main difference is that the 
keys are enriched with the entities' source and that each entity 
(value) is annotated with its source during the map phase. 
Hence, map outputs key- value pairs with key=(reduce_index 
block_index split source) and value=(entity). This allows 
the reduce phase to easily identify all pairs of entities from 
different sources. Like in the one-source case BlockSplit splits 
large blocks but restricts the resulting match tasks k.i x j 
so that Ui e R and Uj e S. 
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Fig. 17. Example PairRange dataflow for 2 sources. 



Figure 16 shows the workflow for the example data of 
Figure 15(a). The BDM indicates 12 overaU pairs so that the 
average reduce workload equals 4 pairs. The largest block $3 
is therefore subject to split because it has to process 6 pairs. 
The split results in the two match tasks 3.0x1 and 3.0x2. 
All match tasks are ordered by the number of pairs: 0.^ (4 
pairs, reduceo), 3.0x1 (4 pairs, reducei), 2. * (2 pairs, reduce2), 
3.0x2 (2 pairs, reduce2). The eventual dataflow is shown in 
Figure 16. Partitioning is based on the reduce task index only, 
for routing all data to the reduce tasks whereas sorting is done 
based on the entire key. The reduce function is called for every 
match task k.i x j and compares entities considering only 
pairs from different sources. Thereby, the reduce tasks read 
all entities of R and compare each entity of S to all entities 
of R. 

B. Pair-based Load Balancing 

The PairRange strategy for two sources follows the same 
principles as for one source. Entity enumeration is realized per 
block and source like in Section V. Pair enumeration is done 
for blocks using entities of^i^R and ^^^5- sharing the same 
blocking key. The enumeration scheme is column-oriented but 
all cells of the x \^i^s\ matrix will be enumerated. For 

two entities cr G ^i^r and es G ^i^s with entity indexes 
X and respectively, the pair index is defined as follows: 
Pi{x, y) = c(x, y, \^i,s\) + o{i) with c(x, y,N) = x - N ^y 
and o(i) = Y}-2^{\^uM ' \^Ks\) - L Figure 15(b) shows 
the resulting match-pair enumeration for our running example. 
With r = 3, the resulting 12 pairs are divided into three ranges 
of size 4. Block (blocking key equals y) needs not to be 
considered because no entity in source S has such a blocking 
key. 

The map phase identifies all relevant ranges for each entity. 
For an entity cr G ^i^r with index x the ranges of pairs 
Pi(x,0) through need to be considered whereas 



for es G ^i^s with index y the pairs Pi{0^y) through 
Piil^iM^y) relevant. 

For each relevant range, map emits a key-value pair with 
key=(rangeJndex blockJndex source entityJndex) and 
value= entity. Compared to the the one-source case, in addition 
to its entity index, each entity (value) is also annotated with 
its source (R or S). Partitioning is based on the range index 
only, for routing all data to the reduce tasks. Sorting is done 
based on the entire key. The reduce function is called for 
every block and compares entities like in the one- source case 
but only considers pairs from different sources. 

Figure 17 illustrates the approach using the example data 
of Figure 15(a). For example, entity C e R is the first entity 
(index=0) within block $3. It takes part in ranges 3?i and 
5R2 and will therefore be sent to the second and third reduce 
task. Hence map emits two keys {1.3. R.O) and {2.3. R.O), 
respectively, for entity C. 

Appendix II 
Listings 

In the following, we show the pseudo-code for the two 
proposed load balancing strategies and the BDM computation. 
Beside the regular output of Algorithm 3 (the BDM itself), 
map uses a function additionalOutput that writes each entity 
along with its computed blocking key to the distributed file 
system. The additional output of the first MR job is read by the 
second MR job. By prohibiting the splitting of input files, it is 
ensured that the second MR job receives the same partitioning 
of the input data as the first job. A map task of the second 
job processes exactly one additional output file (produced by 
a map task of the first task) and can extract the corresponding 
partition index from the file name. With the help of Hadoop's 
data locality for map task assignment, it is likely that there is 
no redistribution of additional output data. 

The map tasks of the second job read the BDM at ini- 
tialization time. It is not required that each map task holds 
the full BDM in memory. For each blocking key that occurs 
in the respective map input partition, it is sufficient to store 
the overall sum of entities in previous map input partitions 
(Algorithm 2 Lines 4-8). Furthermore, it would be possible to 
store the BDM in a distributed storage like HBase to avoid 
memory shortcomings. 

For readability, the pseudo-code refers to the following 
functions: 

• BDM.blocklndex(blockKey) returns the block's index 

• BDM.size(blocklndex) returns #entities for a given block 

• BDM.size(blockIndex, partitionlndex) returns #entities 
for a given block in this partition 

• BDM.pairsO returns overall number of entity pairs 

• getNextReduceXask returns the reduce task with the 
fewest number of assigned entity comparisons (BlockSplit) 

• addCompsToReduceTask(reduceTask, comparisons) in- 
creases number of assigned pairs of the given reduce task 
by the given value (BlockSplit) 

• match(ei, 62) compares two entities and adds matching 
pairs to the final output. 



Algorithm 1: Implementation of BlockSplit 

1 map_conf igure (m, r, partition Index) 
matcli Tasks ^ empty map; 

3 compsPerReduceTask ^ BDM.pairs()/r; 

4 // Read BDM from reduce output of Algorithm 3 
BDIVI ^ readBDMO; 

// Match task creation 
for /e ^ to BDM.numBlocks()-l do 

comps ^ I • BDM. size{k) ■ {BDM. size{k) - 1); 
if comps < compsPerReduceTask then 
|_ matcliTasks.put((k, 0, 0), comps); 

else 

for i ^ to m-7 do 

^ BDM.size(k, i); 

14 for j ^ to i do 

15 ^ BDM.size(k, j); 

16 if \ • |<^>^| > then 
if i = j then 

18 I matcliTasks.put((k, i, j), 

L M^il-(l^il-i)); 

else 

[_ matcliTasks.put((k, i, j), 1$'^ | • |); 



// Reduce task assignment 
m atcllTas ks .orderBy ValueDescendingO ; 
foreach ((k,i,j), comps) G matcli Tasks) do 

reduceTask getNextReduceTaskQ; 
matchTasks.put((k, i, j), reduceTask); 
addC omp sToReduceTas k(reduceTask, comps); 



28 // Operate on additional map output of Algorithm 3 

29 map blocking Key, Vin^entity) 
k ^ BDM.blocklndex(blockingKey); 
comps ^ I • BDM. size{k) ■ {BDM. size{k) - 1); 

32 if comps < compsPerReduceTask then 

33 if comps >0 then 

34 reduceTask ^ matchTasks.get(k, 0, 0); 

35 output(ktrn,p=rcduceTask.k.O.O, 
vtmp =(entitiy, partitionlndex)); 



43 
44 



else 



for i ^ to m-7 do 

min ^ min(partitionlndex, i); 
max max(partition Index, i); 
reduceTask matchTasks.get(k, max, min); 
if reduceTask ^ null then 
I output(ktrMp=reduceTask.k.max.min, 
[_ vtmp=(entitiy, partitionlndex)); 



45 // part: Repartition map output by reduceTask 

46 // COmp: Sort by blocklndex . i . j (k.i.j) 

47 // group: Group by blocklndex . i . j (k.i.j) 

48 reduce {ktmp-reduceTask.k.i.j, 

list(vtmp)=list((entity, partitionlndex))) 
buffer ^ []; 
if i = J then 

foreach (e2, partitionlndex) G list(vtmp) do 
foreach ei E buffer do 
|_ match(ei, e2); 

buffer.append(e2); 



else 



pair ^ list(vtrn.p)-firstElement(); 

buffer. append(pair. first) ; 

firstPartitionlndex ^ pair.second(); 

foreach (e2, partitionlndex) e list(vtmp) do 

if partitionlndex = firstPartitionlndex then 
|_ buffer. append(e2); 

else 

foreach ei G buffer do 
|_ match(ei, e2); 
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Algorithm 2: Implementation of Pair Range 

1 map_conf igure (m, r, partitionlndex) 

BDM -h- readBDMO ; // Output of MR job 1 

compsPerReduceTask ^ [BDM.pairs()/r] ; 

entitylndex ^ [] ; // Next entity index for each block 

for i ^ to BDM.numBlocks( )-l do 
entitylndex [i] ^ 0; 
for j ^ to partition I ndex-7 do 
|_ entitylndex [i] ^ entitylndex[i]+ BDM.size(i, j) 

9 // Operate on additional map output of Algorithm 3 
10 map (/:in=blockingKey, Vin=entity) 
ranges 0; 

i BDM.blocklndex(blockingKey); 

13 X entitylndex [i]; 

14 N ^ BDM.size(i); 

^min ^ rangelndex(0, max(x, 1), N, i); 
^max ^ range Index(min(x, N-2), N-1, N, i); 
ranges ^ {^min,} U {^max}; 
if ranges. size>2 then 

for ^ ^ 7 to x-1 do 
|_ ranges ^ ranges U{k}; 

^med ^ range Index(min(x, N-2), min(x+l, N-1), N, i); 

22 for k ^ ?f^med tO ^max-1 dO 

23 |_ ranges ^ ranges U {k}; 

foreach r G ranges do 
\_ output (kt^p^r.i.x, vt^p =(entitiy, x)); 

entitylndex [i] ^ entitylndex[i]+l; 



27 reduce.conf igure (m, r) 
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I BDM ^ readBDMO; 
|_ compsPerReduceTask ■ 



[BDM.pairsO/r] ; 



// Repartition map output by range index (r) , sort by 

31 // blocklndex . entitylndex (i.x), group by blocklndex (i) 

32 reduce {ktmp = r.i.x, list(vtmp)=list((entity, x))) 

33 N ^ BDM.size(i); 
buffer ^ []; 

foreach (e2, ^2) G list(vtmp) do 
foreach (ei, xi) G buffer do 

k rangelndex (xi, X2, N, i); 
if k=r then 

|_ match(ei, e2) ; // Comparison; output matches 

else if k>r then 
1^ return; 

buffer.append((e2, X2)); 

43 rangelndex {col, row, blockSize, blocklndex) 

celllndex ■<r- 0.5 • col • (2-blockSize-col-3)+row-l; 
pairlndex celllndex + pairlndexOf f set(blocklndex); 
return [pairlndex/compsPerReduceTaskJ ; 

pairlndexOf f set (blocklndex) 
sum ^ 0; 

for /c ^ to blockIndex-1 do 
\_ sum -h- BDM.size(k)-(BDM.size(k)-l) + sum; 

return 0.5 • sum; 



Algorithm 3: Computation of the BDM 

1 map_conf igure (m, r, partitionlndex) 

2 |_ // store partitionlndex 

3 map {kin=unused, Vin=entity) 
blockingKey = computeKey (entity); 

additionalOutput (k=blockingKey, v=entity) ; //to DFS 

output {kfmp =blockingKey.partitionlndex, vt^p=l); 

// Repartition map output by blockingKey, sort by 

// blockingKey .partitionlndex, group by blockingKey .partitionlndex 
reduce (^tmp=blockingKey.partitionlndex, list(vtmp)=list(l))) 
sum ^ 0; 

foreach number in list(vtmp) do 
|_ sum ^ sum+number; 

out ^ blockingKey +","+partition Index +","+sum; 

output (ko^it=unused, Voiit=out); 



